Adaptive Partitioning for Very Large RDF Data
نویسندگان
چکیده
State-of-the-art distributed RDF systems partition data across multiple computer nodes (workers). Some systems perform cheap hash partitioning, which may result in expensive query evaluation, while others apply heuristics aiming at minimizing inter-node communication during query evaluation. This requires an expensive data pre-processing phase, leading to high startup costs for very large RDF knowledge bases. Apriori knowledge of the query workload has also been used to create partitions, which however are static and do not adapt to workload changes; as a result, inter-node communication cannot be consistently avoided for queries that are not favored by the initial data partitioning. In this paper, we propose AdHash, a distributed RDF system, which addresses the shortcomings of previous work. First, AdHash applies lightweight partitioning on the initial data, that distributes triples by hashing on their subjects; this renders its startup overhead low. At the same time, the locality-aware query optimizer of AdHash takes full advantage of the partitioning to (i) support the fully parallel processing of join patterns on subjects and (ii) minimize data communication for general queries by applying hash distribution of intermediate results instead of broadcasting, wherever possible. Second, AdHash monitors the data access patterns and dynamically redistributes and replicates the instances of the most frequent ones among workers. R. Harbi · I. Abdelaziz · P. Kalnis · M. Sahli King Abdullah University of Science & Technology, Thuwal, Saudi Arabia E-mail: {first}.{last}@kaust.edu.sa N. Mamoulis University of Ioannina, Greece E-mail: [email protected] Y. Ebrahim Microsoft Corporation, Redmond, WA 98052, United States E-mail: [email protected] As a result, the communication cost for future queries is drastically reduced or even eliminated. To control replication, AdHash implements an eviction policy for the redistributed patterns. Our experiments with synthetic and real data verify that AdHash (i) starts faster than all existing systems, (ii) processes thousands of queries before other systems become online, and (iii) gracefully adapts to the query load, being able to evaluate queries on billion-scale RDF data in sub-seconds.
منابع مشابه
DREAM: Distributed RDF Engine with Adaptive Query Planner and Minimal Communication
The Resource Description Framework (RDF) and SPARQL query language are gaining wide popularity and acceptance. In this paper, we present DREAM, a distributed and adaptive RDF system. As opposed to existing RDF systems, DREAM avoids partitioning RDF datasets and partitions only SPARQL queries. By not partitioning datasets, DREAM offers a general paradigm for different types of pattern matching q...
متن کاملScaling Queries over Big RDF Graphs with Semantic Hash Partitioning
Massive volumes of big RDF data are growing beyond the performance capacity of conventional RDF data management systems operating on a single node. Applications using large RDF data demand efficient data partitioning solutions for supporting RDF data access on a cluster of compute nodes. In this paper we present a novel semantic hash partitioning approach and implement a Semantic HAsh Partition...
متن کاملEfficient SPARQL Query Evaluation via Automatic Data Partitioning
The volume of RDF data increases very fast within the last five years, e.g. the Linked Open Data cloud grows from 2 billions to 50 billions of RDF triples. With its wonderful scalability, cloud computing platform like Hadoop is a good choice for processing queries over large data sets. Previous works on evaluating SPARQL queries with Hadoop mainly focus on reducing the number of joins through c...
متن کاملPHD-Store: An Adaptive SPARQL Engine with Dynamic Partitioning for Distributed RDF Repositories
Many repositories utilize the versatile RDF model to publish data. Repositories are typically distributed and geographically remote, but data are interconnected (e.g., the Semantic Web) and queried globally by a language such as SPARQL. Due to the network cost and the nature of the queries, the execution time can be prohibitively high. Current solutions attempt to minimize the network cost by r...
متن کاملWorkload-Aware RDF Partitioning and SPARQL Query Caching for Massive RDF Graphs stored in NoSQL Databases
Governments, corporations, startups, open data initiatives and other organizations are increasingly considering RDF and SPARQL in a broad range of information management scenarios. To reduce SPARQL querying times has been the main issue for virtually all the recent RDF triplestores, yet SPARQL caching techniques have not been broadly considered. In this paper we present Rendezvous, a middleware...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره abs/1505.02728 شماره
صفحات -
تاریخ انتشار 2015